Expected Number of Steps in Binomial Search

This post is inspired by a puzzle from The Riddler. The exact text of the puzzle is as follows:

One of Ollie’s favorite online games is Guess My Word. Each day, there is a secret word, and you try to guess it as efficiently as possible by typing in other words. After each guess, you are told whether the secret word is alphabetically before or after your guess. The game stops and congratulates you when you have guessed the secret word. For example, the secret word was recently “nuance,” which Ollie arrived at with the following series of nine guesses: naan, vacuum, rabbi, papa, oasis, nuclear, nix, noxious, nuance.

Each secret word is randomly chosen from a dictionary with exactly 267,751 entries. If you have this dictionary memorized, and play the game as efficiently as possible, how many guesses should you expect to make to guess the secret word?

When I first read the puzzle, my thoughts directly went to Binomial search algorithm. I would first guess the word right in the dead middle of the dictionary, use that to infer if the word is on the left or on the right, guess another world in the dead middle of the left or right block and so on. By halving the size of the list, I can find any word in $\lceil \log_2 267751 \rceil = 19$ guesses. That is because the worst-case complexity of a binomial search of a list with $N$ elements is $O(\log N)$. This is something anyone doing computer science knows too well.

PNG%20image-F6F845F0C22B-1.png

Another way of representing this problem is with a binary search tree.

PNG%20image-78F27C54AB51-1.png

The worst case complexity of the algorithm is given by the depth of the search tree.

One step further

The puzzle goes one step further. It asks for the expected number of guesses. $\lceil \log_2 N \rceil$ is the number of steps it takes to find the worst-case word. A lot of words take much fewer steps. If the word to be guessed is at the dead center of dictionary, we will have guessed it with just one guess.

So what is the expected number of steps to identify a random word?

A different question

Before we answer that question, let's simplify the problem a little. Suppose I have an array of $N$ memory cells, each identified by the binary represetation of $N$. The memory cells hold words in the same order they appear in the dictionary. You query the game from the puzzle above by guessing a memory cell. The game tells you if you guessed the word correctly and, if not, whether the word is lower in alphabetical ordering.

The first question we need to answer answer is this: how many guesses it takes to identify a word which is at memory location $x$. If we are looking at a memory cell identified by $0111$ out of $15=1111_2$ cells, it takes only one step. Let's enumerate the number of steps it takes to identify .

PNG%20image-33D6E1CAA154-1.png

One observation we make is this : for every guess, you deduce one binary digit of the memory location from the left. The first guess tells you if the first digit is 0 or 1. The second guess tells you if the second digit is 0 or 1. and so on. The second observation we make is this: Every guess we make, has a form **11, i.e. the last digits of the guess is a string of $1$.

The number of guesses we would have to make to reach a number depends on the position of the last 0 in the binary representation. If a number has 0 in its last position, we have to wait until the very end or $\lceil \log N \rceil$ guesses to reach that word. You can see that in the figure above, all the even number need a higher number of steps.

So the conclusion is : the number of guesses to reach $N$th memory entry is the position (from the left) of the last 0 in the binary representation of $N$.

Expected number of guesses

Assume the dictionary has $N$ words. For simplicity, assume $N = 2^n-1$ for some natural number $n > 0$. The probability of picking a word is uniformly $1/N$. There is only one memory cell which has 0 in the first position on the left. There are 2 possibilities for 0 on the second position, 4 for 0 in the third position and so on. The expected number of steps, $T$, is thus given by

$$E[T] = \frac{1}{N} \sum_{k=1}^n k 2^{k-1}$$

Computing this sum is also an interesting problem. Consider the following sum

$$\sum_{k=1}^n k x^{k-1} = \sum_{k=1}^n \frac{\partial}{\partial x}\left( x^k \right) = \frac{\partial}{\partial x}\left( \sum_{k=1}^n x^k \right)$$

The term in side the parenthesis is simply the geometric sum which gives $\frac{x \left(x^n-1\right)}{x-1})$. Taking the derivative of this with respect

$$\frac{\partial}{\partial x} \left(\frac{x \left(x^n-1\right)}{x-1}\right) = \frac{n x^n}{x-1}-\frac{x \left(x^n-1\right)}{(x-1)^2}+\frac{x^n-1}{x-1}$$

Setting $x=2$, we recover our original sum. $$\left. \sum_{k=1}^n k x^{k-1}\right|_{x=2} = \left. \frac{n x^n}{x-1}-\frac{x \left(x^n-1\right)}{(x-1)^2}+\frac{x^n-1}{x-1} \right|_{x=2} = \left( 2^n (n-1)+1\right) $$ With this, we can calculate the expectation $$ E(T) = \frac{1}{2^n-1} \left( 2^n (n-1)+1\right)$$

Getting back to the original puzzle

The tricky part about the puzzles is that $267751 = 1000001010111100111_2$ is not a nice number of the form $2^k$. One way of handling cases like is to use a binary search tree with $2^18 = 262144$ elements and attach the remainder $267751-2^18 = 5607$ as additional embellishments to the original tree. This is often called a decorated binary search tree. For $N=2^4+3 = 18$, we get a search tree that looks like

PNG%20image-303F327D62A8-1.png

For the $2^18$ words in the dictionary, the expected number of steps was calculated in the analysis above. For the remainder, we need an additional step to get a total of 19 steps. The expected number of steps is therefore

$$E(T) = \frac{2^{18}}{267751}\left( \frac{1}{2^{18}-1} \left( 2^{18} (18+1)+1\right) \right) + \frac{5607}{267751}(18 + 1) = 17.0419$$